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Abstract 

We describe a PCA-based genome scan approach to analyze genome-wide admixture structure, and introduce 
wavelet transform analysis as a method for estimating the time of admixture. We test the wavelet transform 
method with simulations and apply it to genome-wide SNP data from eight admixed human populations. The 
wavelet transform method offers better resolution than existing methods for dating admixture, and can be applied 
to either SNP or sequence data from humans or other species. 



Background 

An admixed population arises when individuals from 
two or more distinct populations start exchanging 
genetic material. Studying admixed populations can be 
particularly useful for understanding differences in dis- 
ease prevalence and drug response among different 
populations. There is ample evidence that human popu- 
lations have different susceptibility to diseases, exhibit- 
ing substantial variation in risk allele frequencies [1]. 
For example, genetic predisposition to asthma differs 
among the differentially-admixed Hispanic populations 
of the United States, with the highest prevalence 
observed in Puerto Ricans. Genetic variants responsible 
for the increased asthma prevalence in this population 
were localized using an admixture mapping approach 
[2]. This method allows the identification of disease 
causing variants by estimating ancestry along the gen- 
ome, and narrowing the search to the genomic regions 
with ancestry from a population that has a greater risk 
for the disease [3,4]. The same approach was used to 
identify genetic loci that influence susceptibility to obe- 
sity, which is about 1.5-fold more prevalent in African- 
Americans than in European- Americans [5]. 

Admixed populations are also of interest to population 
geneticists as they offer invaluable insights into the 
impact of various human migrations. For example, Poly- 
nesian populations are of dual Melanesian and Austro- 
nesian ancestry, with more maternal Austronesian and 
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paternal Melanesian ancestry, highlighting the impor- 
tance of sex-specific processes in human migrations [6]. 
The analysis of the pattern of sharing of chromosomal 
regions between populations has provided important 
insights into human colonization history including mul- 
tiple migration waves into the Americas, and a complex 
movement of people across Europe [7]. A study of 
admixture patterns in Indian populations revealed that 
most Indians today trace their ancestry to two ancient, 
genetically-divergent populations [8]. 

Analyses of admixture patterns in human populations 
have also proven useful for studies of local selection. 
The genomewide distribution of ancestry has been 
examined and signals of recent selection have been 
identified in admixed populations of Puerto Ricans [9] 
and African Americans [10]. 

Over the years various methods have been developed 
to study genetic ancestry both at the level of an entire 
population [11], and at the level of individuals within 
admixed populations [4,10,12-17]. Because genetic 
recombination breaks down parental genomes into seg- 
ments of different sizes, the genome of a descendant of 
an admixture event is composed of different combina- 
tions of these ancestral segments, or 'blocks'. The distri- 
bution of ancestry proportions within a population and 
the structure of an admixed genome can thus provide 
information on the timing of the admixture event itself. 
Previously, a likelihood-based method (HAPMIX) was 
developed to infer the time of admixture events from 
the haplotype block information [17]. Here we introduce 
a PCA-based genome scan approach to detect and date 
admixture events. Stepwise principal component analysis 



© 201 1 Pugach et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons 
BiolVlGCl C6ntTcll Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



Pugach et al. Genome Biology 201 1, 12:R19 
http://genomebiology.eom/content/1 2/2/R1 9 



Page 2 of 18 



is carried out along each chromosome of an admixed 
individual and respective parental populations, and spec- 
tral decomposition of the resulting signal is used to infer 
the date of admixture. We validate the method on simu- 
lated data sets, and on a sample of African Americans, 
as a population with known admixture history. To test 
how the performance of our approach compares to 
HAPMIX, which uses a fundamentally different metho- 
dology to infer local ancestry, we apply our method to 
the Human Genome Diversity Panel (HGDP) popula- 
tions [18] for which European admixture has been esti- 
mated and dated using HAPMIX [17]. Finally, we apply 
the method to elucidate the structure and admixture 
proportions, and estimate admixture time, in a Fijian 
population and in a diverse sample of Polynesians [19]. 

Results and discussion 

Overview of the method 

The idea behind the method is straightforward: when 
two populations admix, genetic recombination starts 
breaking 'ancestral' genomes into blocks of different 
sizes, so that the genomes of the descendants of an 



admixture event are composed of different combinations 
of these ancestral blocks (Figure 1). Hence, by screening 
the genome of an individual of mixed ancestry, we iden- 
tify stretches of the genome which are inherited from 
either of the ancestral populations. Moreover, the struc- 
ture of an admixed genome contains information on the 
timing of the admixture event itself. The number of 
admixture blocks reflects past recombination events, 
and similarly the width of such blocks also contains 
temporal information, as more recombination events 
would result in narrower blocks that are more evenly 
spread along and among chromosomes. 

To analyze local genomic admixture structure for an 
individual in an admixed population and to use such 
structure to infer the date of admixture we introduce a 
two-part method. The first part of the method, named 
StepPCO, is an extension of principal component analy- 
sis (PCA) and is used to obtain a signal of admixture 
from an individual genome. The second part of the 
method relies on the wavelet decomposition of this 
admixture signal to extract information about the date 
of the admixture event. Here we provide a descriptive 



Population A 



Population B 
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Figure 1 Diagram giving an overview of the admixture process. When two populations admix, genetic recombination starts breaking 
ancestral genomes into blocks of different sizes, so that the genomes of the descendants of an admixture event are composed of different 
combinations of these ancestral blocks. The number and width of the admixture blocks contain information about the time since admixture, as 
more recombination events result in a greater number of blocks, which with time get progressively narrower and more evenly spread along and 
among chromosomes. 
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overview of the method; the actual methodology is for- 
mally developed in the Materials and Methods section 
below. 

We start by performing a sequential stepwise PCA 
(StepPCO) along each chromosome of an individual 
from an admixed population and of individuals from 
respective parental populations. We consider an 
admixed population as a mixture of two ancestral popu- 
lations, in which the admixture occurred at a single 
timepoint, and assume that no genetic drift occurred 
after the admixture event. These, of course, are simplify- 
ing assumptions, as most human populations are 
expected not only to have many incidents of admixture 
occurring at different points in time, and between differ- 
ent populations, but also to experience genetic drift rela- 
tive to the parental populations. We try to circumvent 
this issue by finding the first principal axis (PA1) based 
on the samples from the proposed ancestral populations 
or their proxies, and then project the admixed dataset 
onto the axis of variation defined by these ancestral 
populations, thereby excluding any signal which poten- 
tially could originate from drift and/or other sources of 
ancestry [8,20]. We then consider a sliding window 
along each chromosome. The size of this window is not 
fixed, but at each position is determined by the statisti- 
cal properties of the collection of SNPs in the window. 
We take evenly spaced points along each chromosome 
(evenly spaced in terms of genetic distances); and each 
point serves as center for the next window. The number 
of points (windows) is chosen so that the windows span 
the entire chromosome, leaving no gaps in between. To 
simplify subsequent wavelet transform analysis, we also 
want the number of windows (or bins) equal to a power 
of two. Starting from the center of each window, we 
increase the window until the mean PCI coordinates for 
the parental populations are separated by three standard 
deviations from each mean. The goal is to achieve a 
complete separation of the parental populations within 
each window, so there is no ambiguity in assigning 
chromosomal segments in an admixed genome to either 
ancestral population. Because human populations are 
closely related, there is an obvious trade-off between the 
signal resolution and uncertainty in ancestry estimation; 
by making the size of the window variable and depen- 
dent on the number of informative sites within a given 
chromosomal region, we always find the smallest possi- 
ble window that gives us optimal signal resolution with- 
out introducing excess errors into the ancestry 
estimation. Using PA1 coefficients as weights, we find 
the average value of SNPs within each window. The 
resulting values are then normalized, so that the ances- 
tral populations correspond to values with means of 1 
and -1, respectively. Thus, for each individual we obtain 
a value for each of the windows, and the windows are 



evenly spaced along the chromosome. For an admixed 
individual, the value in each window will either corre- 
spond to one of the ancestral populations, or have an 
intermediate value corresponding to having one chro- 
mosomal segment from each ancestral population (we 
use unphased data, as phasing at the level of an entire 
chromosome infers haplotypes with significant phasing 
(switch) errors [14,21], making such data unusable for 
time since admixture estimation). Thus for each indivi- 
dual and each chromosome we obtain a StepPCO signal, 
consisting of a sequence of values along the given chro- 
mosome. This part of the method is similar to a recently 
published approach [10], in which local genomic admix- 
ture estimates are inferred using PC analysis on a grid 
of points along the genome (and not genome-wide); 
unlike our method this approach works with very small 
windows of 15 SNPs, and requires a Hidden Markov 
Model (HMM) to infer ancestry state within each win- 
dow. Our implementation is also different in that we 
not only estimate the local genomic level of admixture, 
but also use the identified ancestry block structure to 
date admixture events. 

As mentioned earlier, because most SNPs are not 
fixed between human populations, it is necessary to use 
relatively large windows in order to have enough power 
to reliably assign chromosomal segments to an ancestral 
population. Large windows mean that the exact location 
and width of ancestral blocks in empirical data is diffi- 
cult to determine, because small but informative blocks 
may be missed, while larger blocks that are actually 
noise may be inflated and falsely considered a true sig- 
nal. Therefore rather than to attempt a direct estimation 
of the number of breakpoints [17] we have developed a 
method based on spectral analysis of the signal using 
Haar wavelets [22]. The wavelet transform represents 
the StepPCO signal (described above), as the sum of 
simple waves, each characterized by frequency (or per- 
iod), and position. These wave frequencies are then used 
as a measure of the width of the ancestral blocks. There 
are several advantages to the wavelet transform 
approach. First, wavelet transform of the discrete signal 
is lossless, and describes the data completely [22]. That 
is, wavelet transform coeffiicients could be used to 
recover the original signal exactly. Second, wavelet 
transform also allows for the reduction of noise, which 
in this context is defined as high frequency or low 
amplitude oscillations that are not informative but may 
falsely be considered as true signals. By removing from 
the analysis wavelet coefficients corresponding to the 
high frequency or low amplitude waves within the sig- 
nal, we are able to greatly reduce the noise and distill 
the admixture signal contained within the data. Finally, 
the dominant frequency present in the signal is related 
to the average width of the admixture blocks, and can 
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therefore be used to infer the time of admixture (see 
Materials and Methods, Wavelet Transform section for 
details). 

Since the recombination rate is uneven along the 
chromosome, with 80% of recombination events in 
humans occurring within hotspots [23], to measure dis- 
tances along the chromosome we use genetic map dis- 
tances (measured in cM) rather than physical distances 
(measured in base pairs). We interpolate genetic dis- 
tances from genome-wide recombination rates estimated 
as part of the HapMap project [24]. 

Simulations 

Initial validation of the method was done using an in- 
house forward simulation approach. We start with two 
distinct populations (A and B), simulating chromosomes 
as an interval from zero to one. We choose the recom- 
bination rate to be 2.78 events per chromosome per 
generation, which corresponds to the recombination 
rate observed for human chromosome 1 [25,26]. At 
time T 0 the effective population size of population A 
equals 1,000 individuals, and it receives either 1%, 5%, 
10%, 20%, 30% or 40% migrants from population B. The 
simulation then runs forward for 2,000 generations; the 
population at each generation is split randomly into 
pairs and each pair produces a random number of off 
spring, drawn from the Poisson distribution, with the 
average depending on the specified growth rate. The 
growth rate is chosen so that the population grows from 
an effective size of 1,000 to 10,000 in 2,000 generations. 
Since we are only interested in the dynamics of recom- 
bination, we only keep track of the recombination 
points, with their coordinates along the chromosome 
given as percentages of the total length of the chromo- 
some. This significantly reduces computational time and 
makes modeling of the recombination dynamics feasible. 
We ran independent sets of simulations using either the 
genetic map [25], or the physical position map with 
variable rates of recombination along the chromosome, 
using previously-described parameter values for the 
strength and spacing of hotspots [27]. The recombina- 
tion map was generated at the beginning of each simula- 
tion run. We ran 100 simulations for each of the 
migration parameters, and from each simulation we 
sampled 100 chromosomes at exponentially growing 
time points, and collected statistics on the total amount 
of admixture, the number of breakpoints, and the width 
of admixture blocks (as measured by the wavelet trans- 
form coefficients) for each chromosome in each sampled 
generation. 

The overall admixture rate estimates we obtain are 
highly concordant with the migration rate parameter 
initially set for each simulation, and this estimation is 
not influenced by the time since admixture (Figure 2a). 



The number of breakpoints vs. time (Figure 2b) is 
almost linear, with little oscillation within each simula- 
tion. There seems to be a stochastic period immediately 
following the admixture event, when random processes 
appear to strongly influence the slope. In general, the 
number of breakpoints grows faster with higher admix- 
ture rates (Figure 2b). Up to about 50 generations, the 
number of observed breakpoints closely matches the 
expected value: 

iV bkpts = 2T gen R(Ea(i -Ed)- var a), (1) 

where N stands for the number of breakpoints, T gen 
denotes time since admixture in generations, R corre- 
sponds to the number of recombination events per gen- 
eration, a denotes the admixture rate for an individual, 
and Ea and var a are the mean and the variance of a. 
The deviation of the observed number of breakpoints 
from that expected after 50 generations (Figure 2b) is 
due to the fact that infinite populations the pattern of 
ancestral blocks (their width and distribution along 
chromosomes) becomes more uniform with time, that is 
the recombination events are no longer independent. 

For the calculation of the wavelet transform (WT) 
coefficients, simulated chromosomes were randomly 
paired to form diploids to match the empirical data. 
Also, from the calculated WT coefficients we exclude all 
coefficients describing high frequency wavelets (WT 
levels higher than level seven, as described in the Mate- 
rials and methods, Wavelet transform section) and nor- 
malize for the length of the chromosome by subtracting 
the log of the chromosome length, which would corre- 
spond to the threshold and normalization imposed on 
the empirical data for chromosome 1 (as the simulated 
chromosomes were of the same length as chromosome 
1). The distributions of the WT levels, indicating how 
the wavelet transform spectrum changes with time since 
admixture, are presented in Figure 3. With time the 
center of the WT spectrum shifts from left to right, 
from predominantly low to predominantly high fre- 
quency wavelets. In Figure 2c the WT centers calculated 
for different time points are plotted against time since 
admixture. The centers increase exponentially with time, 
are fairly independent of admixture rate, and are very 
consistent across simulations, especially if the admixture 
rate is over 1%. This measure starts to level off at 
approximately 400 generations since admixture, which is 
due to the elimination of levels containing wavelets of 
highest frequency (done to concur to the filtering 
applied to the empirical data). For the empirical data, 
this removal of high-frequency wavelets is done to 
remove noise, which in turn reflects the relatively low 
density of informative SNPs present in the data; with 
more dense SNP data (or full sequence data), we would 
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Figure 2 Data from 100 simulations for migration values of 1%, 5%, 10% 7 20%, 30%, and 40%. Each curve represents a single admixed 
population. To generate the plots, 100 chromosomes were sampled from each population at exponentially growing time points, and the 
following statistics for each chromosome in each sampled generation were collected: (a) admixture rate; (b) number of breakpoints; black lines 
indicate the expected value, given by: /V bkpts = 27 gen /?(to(1 - Ea) - var a); and (c) the WT centers. Inset: Average number of breakpoints for each 
simulation parameter. Black lines indicate the expected value. 

V J 



expect to have more power to detect more ancient 
admixture events. 

Sensitivity of the method to smaller effective population 
size and continuous migration 

To test the sensitivity of our method with respect to the 
initial effective population size, we ran additional simula- 
tions where the effective size of population A at T 0 equals 
500 or 200 individuals, and it receives either 5%, 10% or 
20% migrants from population B. The growth rate of the 
new admixed population was chosen so that in 2,000 gen- 
erations the population grows to 10,000 or 2,000 indivi- 
duals, respectively. Results are shown in Figure SI in 
Additional file 1. There is no influence of initial population 
size on the performance of the method for admixture 
times up to about 20 generations ago. For populations 
with small Ne and admixture events older than 20 genera- 
tions ago, our approach will tend to overestimate the date 
of admixture for more recent events, and not be able to 
detect more ancient events. Apparently, the diversity in 



the distribution of ancestry blocks diminishes (stabilizes) 
faster in smaller populations, thus making new recombina- 
tion events undetectable. The same phenomenon is 
responsible for the deviation of the number of recombina- 
tion breakpoints observed in our simulations from the 
value predicted by Equation 1. Our results further suggest 
that it is not so much the small Ne, but rather the growth 
rate of the population, which is primarily responsible for 
these deviations. Moreover, the effect is more pronounced 
when the admixture rate is low. 

Additionally, we ran simulations to test how continu- 
ous admixture over time affects the method. Again we 
start with a population A, which at T 0 comprises 1,000 
individuals, and it receives either 5% or 20% migrants 
from population B over the period of either 10 or 30 
generations. The growth rate of the new admixed popu- 
lation was chosen so that in 2,000 generations the popu- 
lation grows to 10,000, and 100 simulations were 
performed for each scenario. Results are shown in 
Figure S2 in Additional file 1. Because new ancestry 
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Figure 3 Distributions of the WT levels, illustrating how the wavelet transform spectrum changes with time since admixture. For each 
illustrated time point, WT levels from 10 randomly chosen simulations are plotted (each bar represents one simulation, resulting in 10 bars for 
each level). The height of the columns indicate the abundance of wavelets of particular frequency present in the signal, starting with the lowest 
wave frequencies (widest recombination blocks) on the left and progressing towards the highest wave frequencies (narrowest recombination 
blocks) on the right. The WT centers in this plot are not adjusted for chromosome length, and thus appear to be higher than the values we 
present for genomewide data. 



blocks are being continuously introduced over the per- 
iod of either 10 or 30 generations, potentially removing 
older block structure by replacing narrower ancestry 
blocks with new wider blocks, we expect ongoing 
admixture to reduce the wavelet transform coefficients 
and therefore lead to an underestimation of time since 
admixture. This is indeed what we observe: irrespective 
of the admixture rate throughout the duration of admix- 
ture (10 or 30 generations), the wavelet transform coeffi- 
cients are lower than those observed in a population 
with the same admixture rate, but which has experi- 
enced not a continuous but a one-time admixture event. 
Once the influx of new genetic material into the simu- 
lated population A stops, the trajectory of growth of the 
wavelet transform coefficients is slowly recovered. 

Sensitivity of the method to levels of linkage 
disequilibrium 

As described in the Overview of the Method, to measure 
distances along the chromosome we use genetic map 



distances (measured in cM) rather than physical distances 
(measured in bp). As we measure distances in units of 
recombination frequency, chromosomal regions with high 
LD, that is low propensity towards recombination, will 
span smaller distances and be represented by a smaller 
number of windows, and conversely genomic regions that 
harbor recombination hotspots will be inflated and repre- 
sented by a larger number of windows. We therefore do 
not expect levels of LD to affect our results. To demon- 
strate this we have measured LD [28] in fixed windows 
across chromosomes 6 and 8 in three HAPMAP popula- 
tions: individuals of European ancestry (CEU), the Yoruban 
(YRI) individuals and the individuals of African ancestry 
from the Southwestern USA (ASW). Genotype data were 
downloaded from the International HapMap project home 
page [29]. The size of the fixed window was chosen as a 
fraction of a chromosome length to correspond on average 
to either 500 kb or to 0.5 cM. In accordance with our 
expectations we observed that the level of LD varies along 
the chromosome if the distances are measured in base 
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pairs, but does not vary as much if the distances are mea- 
sured in cM (Figure S3 in Additional file 1). 

Sample size estimation 

For some of the populations considered in this study the 
sample size was limited to 25 individuals. To ensure that 
the StepPCO method has adequate power, we therefore 
calculated how large a sample size is required for accu- 
rate and reliable estimates of admixture time. We 
sampled from 1 to 50 individuals at random from one 
randomly-chosen simulated population at 12 different 
time points. Average WT centers, based on different 
sample sizes, were calculated and used to infer time 
since admixture by comparing the observed WT centers 
to those obtained using the entire simulated dataset. 
The results (Figure S4 in Additional file 1) indicate that 
a sample size of 10 is sufficient for quite accurate time 
estimation with narrow confidence intervals up to about 
200 generations ago. Point estimates become less pre- 
cise, and confidence intervals become much wider, at 
time points exceeding 500 generations ago. This is 
caused by the threshold imposed on the simulated data 
to concur with the same limitation that is present in the 
empirical data, due to the elimination of levels contain- 
ing wavelets of highest frequency (removal of noise, as 
described above). 

Comparison to HAPMIX: simulated data 

Various methods have been developed to quantify the 
admixture signal along individual chromosomes, such as 
ANCESTRYMAP [4], SABER [14], LAMP and LAMP- 
ANC [15], uSWITCH and uSWITCH-ANC [16], and 
HAPMIX [17]. To test the performance of the StepPCO 
approach relative to these other programs, we chose to 
compare the method only to HAPMIX, as this approach 
has been shown to perform better relative to the other 
methods in predicting ancestry transitions, especially for 
smaller ancestry segments which carry information on 
more ancient admixture events [17]. 

To compare to HAPMIX, we constructed an artificially 
admixed dataset from the phased genotypes of the Yoru- 
ban (YRI) individuals and individuals of European ances- 
try (CEU), downloaded from the International HapMap 
project home page [30]. Forty haploid admixed genomes 
were constructed as described previously [17], namely for 
each simulated chromosome we randomly selected one 
haploid Yoruban and one haploid CEU genome, and 
built a recombination map by drawing from an exponen- 
tial distribution with weight A, such that the ancestry 
switch occurred with probability 1 - e ~ Xg for each distance 
of g Morgans. Starting at the beginning of each chromo- 
some and at each of the recombination points from the 
recombination map, we sampled European ancestry with 
probability a and African ancestry with probability 1 - a, 



where the value of a was sampled once from a beta dis- 
tribution with mean 0.20 and standard deviation 0.10, 
typical for African Americans [17]. We simulated the fol- 
lowing values of A: 6, 10, 20, 40, 60, 100, 200, 400. Once 
an artificial genome was constructed, parental chromo- 
somes were never reused. Pairs of the resulting artificially 
admixed haploid genomes were merged to create 20 
diploid admixed individuals. 

We then compared the performance of our spectral 
decomposition method to HAPMIX on these artificially 
admixed genomes. We ran a StepPCO analysis, followed 
by the WT decomposition of the resulting StepPCO 
admixture signal. To investigate how the dominant wave- 
let frequency is related to X for this artificial dataset, we 
generated a separate dataset of hybrids. Maps of recombi- 
nation events for each of these additional hybrid genomes 
for the values of X: 6, 10, 20, 40, 60, 100, 200, 400 were 
constructed as described in the previous paragraph. 20 
hybrids genomes were constructed for each value of X, 
and spectral decomposition analysis was carried out on 
the resulting admixture signal. Using the known number 
of breakpoints in these simulated hybrids, we have found 
that the dominant wavelet frequency is linearly related to 
the logarithm of the average (per Morgan) number of 
ancestry switches (breakpoints); linear regression was 
used to find the coefficients, and estimate the average 
number of ancestry switches per unit of genetic distance 
in the main simulated dataset. 

We also ran HAPMIX on the same artificially- 
admixed samples, using 40 haploid YRI and 40 haploid 
CEU genomes as the reference parental populations and 
using the input parameters recommended previously 
[17]. We calculated the number of ancestry switches 
detected by HAPMIX, as the output of HAPMIX pro- 
duces probability associated with each SNP genotype in 
the admixed genome. We then compared the true num- 
ber of ancestry switches per Morgan of genetic distance 
(known for simulated data) to the estimates produced 
by either HAPMIX or WT decomposition of the admix- 
ture signal. The results (Figure 4) show that HAPMIX 
consistently underestimates the number of breakpoints, 
while the estimates obtained by the WT analysis are 
more accurate, especially for higher values of X, typical 
of more ancient admixture events. 

However, accurate estimation of breakpoints does not 
imply accurate estimation of the admixture time. As 
demonstrated previously (Figure 2b), the number of 
breakpoints can deviate significantly from the value pre- 
dicted by Equation 1, especially with higher admixture 
rates, lower ancestral Ne, and/or older admixture times. 
Furthermore, inference of breakpoints requires transfor- 
mation of the 'raw' genomic signal into a discrete signal 
corresponding to the presence or absence of an ancestry 
switch, hence direct inference of number of breakpoints 
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Figure 4 Performance of Hapmix and wavelet transform analysis in recovering the average number of recombination breakpoints per 
Morgan of genetic distance from simulated data. The two methods were applied to 20 artificially admixed individuals, created using a 
genomewide average of 20% European and 80% African ancestry. For simulated data the average number of ancestry switches (or breakpoints) 
was drawn from an exponential distribution with weight X, such that the ancestry switch occurred with the probability 1 - e Xg for each distance 
of g Morgans. The following values of X were simulated: 6, 10, 20, 40, 60, 100, 200, 400. Since in the simulated genomes the true number of 
breakpoints is known, we show the accuracy of both methods in recovering this information. 



is inevitably error-prone. These errors, however small, 
will accumulate over the many measurements taken. 
WT analysis avoids such errors because rather than 
inferring the recombination events directly, the WT 
method compares the spectral properties of the given 
signal in the observed data with the properties of the 
model signal produced by simulations. 

Empirical data 

Quality-filtered genotypes for approximately one million 
SNPs for 25 Polynesians (PLY), 25 Fijians (FIJ), 23 



Borneans (BOR) and 25 individuals from the highlands 
of Papua New Guinea (MEL), typed with Affymetrix 6.0 
arrays, were obtained from a previous study [19] and are 
available from the authors upon request. Quality-filtered 
genotypes obtained with the Illumina Humanl M and 
Affymetrix 6.0 arrays for 20 Yorubans from Ibadan, 
Nigeria (YRI), 20 individuals of northern and western 
European ancestry living in Utah (CEU) and 20 indivi- 
duals of African ancestry from the Southwestern USA 
(ASW), were downloaded from the International Hap- 
Map project home page [29]. SNPs were merged using 
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the PLINK tool [31], to include only markers which 
were genotyped and passed the quality filters in both 
datasets. The final dataset comprised 653,498 SNPs. We 
also analyzed and dated admixture in Mandenka, Moza- 
bite, Bedouin, Palestinian and Druze groups from the 
CEPH-HGDP [18]. These groups were previously ana- 
lyzed via HAPMIX and reported to have European- 
related ancestry ranging from 2% to 97%, when analyzed 
using Africans and Europeans from the HapMap as the 
input reference populations [17]. These samples were 
genotyped for 650,000 SNPs on the Illumina platform 
[18]. The data were downloaded from the HGDP CEPH 
Genotype Database [32]. 

For the empirical admixture analyses, the parental 
groups are: the French and Yoruba for the admixed 
Mandenka, Mozabite, Bedouin, Palestinian and Druze 
groups; the YRI and CEU groups for the admixed ASW 
group; the BOR and MEL groups for the admixed PLY 
group; and the MEL and PLY groups for the admixed 
FIJ group. The StepPCO approach was first used to elu- 
cidate the local structure of the admixture signal for 
each admixed individual along each chromosome. We 
then estimated admixture proportions in each admixed 
group, and compared the StepPCO results for each 
chromosome to admixture proportions estimated using 
the maximum-likelihood based algorithm implemented 
in frappe [13]. We then applied wavelet transform analy- 
sis to the StepPCO signal and used the wavelet trans- 
form coefficients to infer time since admixture. After 
the wavelet transform coefficients were calculated we 
applied three filtering procedures to the signal. First, as 
explained previously, we replaced all coefficients smaller 
than an ascertained threshold value by zero, to remove 
low amplitude oscillations that are characteristic of 
noise (that is, wavelets of low height). This threshold 
value was chosen so that small oscillations present only 
within the distribution of the parental individuals are 
ignored. Second, we removed WT levels that correspond 
to the wavelets of the highest frequencies, which are 
also characteristic of noise (that is wavelets that are too 
narrow). Then we averaged the coefficients across each 
level and found a threshold amplitude, which is present 
in every individual whether admixed or not, and con- 
sider everything below it as noise (in effect this means 
that we subtract the parental signal, that is when we 
analyze the admixture signal in FIJ for example, with 
PLY and MEL being the ancestral populations, the fact 
that PLY themselves harbor Melanesian admixture has 
no effect on the inference of the admixture date for FIJ). 
Finally, we find the dominant frequency present in the 
signal (WT center) and use it to infer the time of 
admixture by comparing this observed dominant fre- 
quency to that obtained in simulated data generated 
using the admixture rate observed in the empirical data. 



African-Americans, Polynesians and Fijians 

StepPCO plots for one ASW, one PLY, and one FIJ are 
presented in Figure 5. The pattern of chromosomal seg- 
ments alternating between two ancestral states is char- 
acteristic of all admixed individuals and is observed on 
all chromosomes. For some chromosomal segments 
intermediate PCA1 values are observed, indicating that 
the admixed individual contains chromosomal segments 
from both parental populations (see Figure S5 in Addi- 
tional file 1 for StepPCO results for the other chromo- 
somes from these three individuals). As described in the 
Overview of the Method, the number of SNPs per slid- 
ing window of the StepPCO analysis is allowed to vary, 
in order to achieve reliable assignment of chromosomal 
segments in admixed individuals to the correct parental 
group. The average number of SNPs per StepPCO slid- 
ing window for chromosome 1, which contained a total 
of 42,499 SNPs after filtering, was: 280 for African 
Americans, 519 for Polynesians, and 1015 for Fiji. This 
variation reflects the different levels of differentiation 
between the ancestral populations of these three 
admixed groups. The largest average size of the sliding 
window is observed for Fiji, where PLY and MEL are 
used as parental groups. As the PLY themselves have 
Melanesian ancestry, PLY and MEL are much less dif- 
ferentiated than CEU and YRI, the parental populations 
of the African Americans. Hence more SNPs are needed 
to reliably assign chromosomal segments in the FIJ 
group to either of the ancestral populations, than are 
needed for the ASW. 

Average admixture proportions estimated by the 
StepPCO method for the African-Americans, Polyne- 
sians and Fijians are 19% European ancestry, 24.9% Mel- 
anesian ancestry, and 40.2% Melanesian ancestry 
respectively (Figure 6a). Individual admixture estimates 
vary substantially among the African-Americans, with 
some individuals exhibiting very low European ancestry 
(less than 5%), and some substantially higher (more 
than 40%). These results were substantiated by the 
frappe [13] analysis, which agree quite closely with the 
per-chromosome ancestry estimates from the StepPCO 
analysis (Figure 6b). A similar pattern is observed in Fiji, 
with Melanesian ancestry ranging from 22% to 63%. 
Despite the fact that the Polynesian sample is very 
diverse, coming from seven different islands [19], the 
level of Melanesian ancestry is much more uniform 
across individuals (varying from 18 to 28%). 

The spectral analysis of the StepPCO signal revealed 
that the average dominant frequency for the African- 
Americans is located at level 1.8, which would corre- 
spond to an abundance of low frequency wavelets (that 
is, wider ancestry blocks), while for the Fijians and the 
Polynesians the average dominant frequency is at level 
3.06 and 3.63 respectively, which is indicative of much 
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Figure 5 PCA and StepPCO results for chromosome 1. Solid lines centered around 1 and -1 indicate the mean PC1 coordinate for each 
parental population; progressively lighter shading surrounding the mean of each parental group indicate +/-1, +/-2 or +/-3 standard deviations 
from the mean, (a) Upper panel: PC1 vs PC2 for populations of CEU, YRI and ASW. Lower panel: Unphased chromosome 1 of an individual of 
African American ancestry; European (blue) and Yoruba (red) populations are used as parental groups, (b) Upper panel: PC1 vs PC2 for 
populations of MEL, BOR and PLY. Lower panel: Unphased chromosome 1 of an individual from Polynesia; Borneo (green) and New Guinean 
(orange) populations are used as parental groups, (c) Upper panel: PC1 vs PC2 for populations of MEL, PLY and FIJ. Lower panel: Unphased 
chromosome 1 of an individual from Fiji; Polynesia (brown) and New Guinean (orange) populations are used as parental groups. 
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(b) Frappe and StepPCO results 
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Figure 6 Admixture estimates, (a) Genome-wide admixture estimates based on StepPCO for African-Americans, Polynesians and Fijians. (b) 
Comparison of admixture estimates obtained via StepPCO vs. Frappe, for chromosome 1 for 20 African-Americans. 
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narrower ancestry blocks (Figure 7). Based on simula- 
tions, the WT center of 1.8 corresponds to an admixture 
time of 6 generations ago (95% CI: 4-8 generations) for 
the African Americans. Assuming a generation time of 
30 years [33], our results indicate that the admixture in 
the African Americans started about 180 years ago. 
Similarly, the simulations indicate that the WT center of 
3.63 for the Polynesians corresponds to an admixture 
time of 90 generations (95% CI: 77-131 generations), or 
about 2,700 years ago (Figure 8). The time estimation 
for Fiji is based on simulated data with a 40% admixture 
rate (to match the higher admixture rate of Fiji), and 
here the WT center of 3.06 corresponds to an admix- 
ture time of 37 generations (95% CI: 29-39) or about 
1,100 years ago. 
HGDP populations 

To test how the performance of our approach compares 
to HAPMIX, we applied our method to the Mandenka, 
Mozabite, Bedouin, Druze and Palestinian populations 
from the CEPH-HGDP [18], which were previously ana- 
lyzed using HAPMIX [17]. The HAPMIX estimates for 
the European-related ancestry in these populations ran- 
ged from 2% to 97%, when analyzed using Africans and 
Europeans from the HapMap as the input reference 
populations (Table 1). The time since admixture in 
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Figure 7 Average centers of the WT coefficients, calculated for 
each individual. 



these populations was inferred by calculating the num- 
ber of genomewide ancestry transitions (or the number 
of breakpoints), and the results are reported in Table 1. 

Although we estimated similar admixture proportions 
for these populations (Table 1), subsequent spectral ana- 
lysis of the admixture signal in the Mandenka, Mozabite, 
Bedouin, Palestinian and Druze revealed older admix- 
ture dates for the Mozabites and the Druze populations 
(Table 1 and Figure S6 in Additional file 1). The 
Bedouin population appears to be structured, with 
24 out of 45 individuals having a much higher propor- 
tion of European-related ancestry (Figure S7 in Addi- 
tional file 1). If these individuals are removed from the 
analysis, the estimate for the admixture time in the Bed- 
ouins changes to 97 generations ago (CI: 83-131). 

All programming and data analysis was performed 
using R (ver. 2.10.1) [34]. All scripts are freely available 
[35]. 

Conclusions 

Using genetic data to infer the time of migrations has 
always been difficult, and the time estimates obtained 
often come within wide confidence intervals, making 
these dates unreliable and inferences problematic. Here, 
we have introduced an approach that takes advantage of 
dense genome-wide SNP data to improve precision and 
reduce bias in making inferences about the timing of 
human migrations. By using an admixed population one 
can capitalize on the property of the genome to recom- 
bine each generation, producing chromosomes that are 
a mixture of the parental genetic material. The structure 
of an admixed genome contains temporal information 
about an admixture event, as a greater number and nar- 
rower width of ancestry blocks indicates more recombi- 
nation events, and hence greater time depth. 

Simulations indicate that the WT coefficients can be 
used to obtain accurate estimates of the time of admix- 
ture from suitable genome-wide SNP data. We therefore 
applied the method to three datasets, consisting of 
about 650,000 SNPs, to estimate the amount and time 
of admixture for three human populations: African- 
Americans, Polynesians, and Fijians. In addition, we ana- 
lyzed and dated admixture in five HGDP populations of 
African and Middle-Eastern origin. At first glance, it 
may appear that the simulated and empirical data differ 
in that the simulations used fully-differentiated popula- 
tions, which is not the case for the empirical data. How- 
ever, as explained in more detail in the Results and 
Discussion (Basic Setup section), the number of SNPs is 
adjusted in the StepPCO sliding windows until the 
ancestral populations can be statistically-differentiated, 
just as with the simulated data. 

For African- Americans, we estimate an average of 19% 
European ancestry, with a wide range of less than 5% to 
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Figure 8 Admixture time estimates for the African-Americans, Polynesians and Fijians. Simulated data from 100 simulations with a 20% 
and 40% migration rate are presented. Each curve represents a single admixed population. Average WT centers calculated for 100 chromosomes 
drawn at random from each population at exponentially growing time points are plotted as a function of time. Measurements obtained for the 
ASW, PLY and FIJ populations are shown by blue horizontal lines. Red vertical lines indicate the time estimate, and shaded boxes define the 
confidence intervals. Time estimate for ASW and PLY are based on simulations with a 20% admixture rate, while the time estimate for FIJ is 
based on simulations with a 40% admixture rate. 



Table 1 Comparison of results from HAPMIX and StepPCO 

Population Estimated percent Estimated percent Estimated time since admixture Estimated time since admixture 

European ancestry from European ancestry from (in gens ago) from HAPMIX (in gens ago) from StepPCO 





HAPMIX 


StepPCO 






Mandenka 


2% 


2% 


120 


121 


Mozabite 


77% 


84% 


100 


131 


Bedouin 


91% 


91% 


90 


83 


Palestinian 


93% 


92% 


75 


72 


Druze 


97% 


96% 


60 


90 



We have removed one outlier individual from the Mandenka population. Among the Bedouins, 24 out of 45 individuals have a higher proportion of European- 
related ancestry (Figure S7 in Additional file 1). If these individuals are removed from the analysis, the estimate for the admixture time in the Bedouins changes 
to 97 generations ago. Estimated percent European ancestry was calculated by using the two parental populations and an admixed group to find the PC1 axis. 
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more than 40% European ancestry across individuals. 
Both the average and the observation of a wide range of 
individual admixture estimates are in keeping with pre- 
vious studies [10,17,36,37]. The estimated time of 
admixture is about 180 years ago (95% CI: 120-240 
years ago), which is probably an underestimate since 
admixture in the African-American population is 
ongoing (implying that new ancestry blocks are being 
continuously introduced by new recombination events, 
which potentially removes older block structure by 
replacing narrower ancestry blocks with new, wider 
blocks). 

We tested the performance of the method on Fijians 
and Polynesians, as both populations are of admixed 
Asian and Melanesian ancestry [6]. Previous demo- 
graphic analyses of the genome-wide SNP data used in 
this study strongly support both an admixed Asian/Mel- 
anesian ancestry for Fijians and Polynesians as well as 
subsequent additional gene flow from Melanesia into 
Fiji, but not Polynesia [19]. Based on this previously 
established scenario, we estimated an average of about 
25% (from 18 to 28%) Melanesian ancestry in Polyne- 
sians, in good agreement with previous estimates based 
on the same [19] or other [6,38-40] data. The estimated 
time of admixture is about 90 generations ago, or 2,700 
years (95% CI: 2,300-3,900 years), in good agreement 
with a previous estimate of about 3,000 years ago based 
on an ABC simulation approach for the same data [19]. 
For Fiji, the estimated amount of Melanesian ancestry 
was about 40%, and the time for this admixture is esti- 
mated to have occurred about 37 generations ago, or 
1,100 years (95% CI: 870-1170 years). An ABC-simula- 
tion based approach for the same data gave an estimated 
date of 62 generations for this admixture in Fijians, 
about twice as long ago as our estimate. We speculate 
that, as in the case of the African- Americans, the esti- 
mate based on WT coefficients may be biased toward 
more recent dates if the gene flow to Fiji occurred over 
a period of time, as more recent gene flow replaces 
older, narrower ancestry blocks with newer, wider 
ancestry blocks. Individual Melanesian ancestry esti- 
mates are much wider for Fiji (from 22 to 63%) than for 
Polynesia (from 18 to 28%), which may indeed indicate 
a longer period of gene flow into Fiji. 

Our results for the Mozabite, Mandenka, Bedouin, 
Druze and Palestinian populations are similar to those 
for HAPMIX for inferring local ancestry, and in addition 
our method seems to perform better with respect to 
more ancient admixture events (as also shown with 
simulated data: Figure 4). In particular, we dated the 
admixture event in the Mozabites and the Druze to 131 
and 90 generations ago respectively, 30 generations 
more than the corresponding estimates obtained with 
HAPMIX [17]. HAPMIX estimation of the time since 



admixture is based on the number of calculated ancestry 
transitions (that is, the number of breakpoints); both our 
simulations and previous simulations [17] indicate that 
infinite size populations the number of breakpoints does 
not increase with time according to expectations (see 
Equation 1 above), but rather stabilizes, leading to 
underestimates in admixture dates (Figure 2b). Further- 
more, because human populations are closely related 
and not very well differentiated, direct estimation of the 
number of breakpoints and block width as a measure of 
time since admixture for human genetic data is proble- 
matic for two reasons. Firstly, to have enough power to 
reliably assign chromosomal segments to an ancestral 
population, it is necessary to use relatively large geno- 
mic windows, which correspondingly reduces detection 
of closely-spaced breakpoints. And secondly, for every 
location in the genome that potentially carries a break- 
point, a formal decision has to be made as to whether 
to consider it a true breakpoint or not. This transforma- 
tion of the 'raw' signal into a discrete signal potentially 
leads to either some not well-defined breakpoints being 
overlooked, or conversely random effects becoming 
inflated and falsely considered as a true signal. These 
errors, however small, will accumulate over the many 
measurements taken. Conversely, the spectral analysis 
approach implemented here does not require any data 
transformation and is applied to the 'raw' signal directly. 
This has the advantage of preserving the statistical nat- 
ure of the signal until the final averaging step, and thus 
does not involve detection of exact location (and pre- 
sence) of breakpoints, where inevitably large errors in 
estimation could occur. Although we followed Price et 
al. 2009 in using African and European parental groups 
for the admixed Mozabite, Mandenka, Bedouin, Druze 
and Palestinian groups from the CEPH-HGDP, in fact 
previous studies have shown that the Druze, Bedouin 
and Palestinian populations are admixed primarily along 
a European-Central Asian axis, with little African 
admixture, and the Mandenka exhibit very little Eur- 
opean admixture [18,41]. Here, we report dates for the 
presumptive European gene flow, to compare our results 
to the previous study [17], but it is important to keep in 
mind that our method (like all admixture methods) 
requires the use of pre-defined parental groups. Incor- 
rect identification of the ancestral groups contributing 
to an admixed group will obviously lead to erroneous 
conclusions, hence careful attention must be paid when 
identifying parental groups. This is especially true for 
groups that are suggested to have experienced admix- 
ture a long time ago, and hence had more time to 
experience genetic drift (which is always expected to act 
in a direction orthogonal to the axis of admixture). In 
such cases, it is difficult to distinguish between an 
admixed population that has been subject to genetic 
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drift, and a population that has experienced admixture 
along a different axis of variation. 

Theoretically, there are no limitations as to how far back 
in time one can get good estimates of admixture time with 
WT coefficients. The performance of the method is influ- 
enced by two factors: the density of SNPs analyzed and 
the degree of differentiation between the two parental 
populations. Increasing SNP density would allow the esti- 
mated time horizon for detecting admixture to be moved 
further back. We therefore expect that full sequence data 
will increase the sensitivity and resolution of our method. 
The analysis presented here was based on about 650,000 
markers; the current estimates for the number of SNPs in 
the human genome is around 15 million SNPs [42]. Full 
sequence data will thus provide a twentyfold increase in 
SNP density, and thereby allow for a twentyfold reduction 
in the size of the sliding window. Thus, assuming that the 
newly added SNPs are no less informative for population 
differentiation than the SNPs on the Affymetrix arrays, we 
expect that analysis of full sequence data should offer at 
least a twentyfold improvement in the potential time 
depth for admixture estimates for human populations. 
However, given that there is relatively little genetic differ- 
entiation between human populations, to distinguish 
among parental populations requires relatively large seg- 
ments of the genome, and this also poses a restriction on 
the time depth of the method. The more closely-related 
the parental populations, the larger the window size 
needed to span a sufficient number of informative SNPs. 
Obviously, this limitation will persist regardless of the type 
of molecular data considered. However, because of the 
strong ascertainment bias associated with the SNPs geno- 
typed on the arrays, we expect that SNP-data generated 
using the array technology necessarily underestimates the 
variation that exists between human populations, and 
recent studies suggest that this underestimation could be 
considerable [43]. Moreover, the method introduced here 
can be used with any species for which suitable genome- 
wide data exist, and the larger the genetic difference 
among the parental populations, the less genome-wide 
data needed for accurate admixture estimates. 

An additional advantage of the StepPCO method is that 
it provides an estimate of the admixture proportions for 
each individual within an admixed population. Individual- 
level estimates of admixture obtained here via StepPCO 
for African- Americans, Polynesians, and Fijians were quite 
similar to those obtained via the maximum-likelihood 
based approach frappe [13], indicating that StepPCO gives 
reliable results. Furthermore, the StepPCO method also 
provides information about the distribution of admixture 
along each chromosome. As such, this approach is also 
promising for disease gene mapping (in recently admixed 
populations), and for studying local selection. According 
to neutral expectations the admixture level should be 



constant along the genome, but a locus favored by positive 
selection in the admixed population should appear to have 
greater admixture proportions than would be expected 
from the genome-wide average. We are currently investi- 
gating the utility of this approach for identifying candidate 
genes subject to local selection. 

In conclusion, we have shown that wavelet transfor- 
mation is a useful and novel means of dating admixture 
events from genome-wide data. Other potential exten- 
sions of the methodology introduced here include 
admixture scenarios involving more than two parental 
populations, and implementing spectral analysis of the 
raw genomic signal directly, rather than from the 
StepPCO signal. There is potentially much more to be 
learned from surfing the wavelets of the genome. 

Materials and methods 

Basic setup 

We consider a collection of N SNPs along a chromo- 
some, which are ordered by their physical position, and 
indexed by numbers from 1 to N. For such a collection 

of SNPs we consider a vector p := (p.) of each SNP's 

physical position along the chromosome scaled with 
respect to the genetic distances, which were interpolated 
from genome-wide recombination rates, estimated as 
part of the HapMap project [24]. Suppose e is an indivi- 
dual chromosome. We also denote e := {e { )f =1 as a col- 
lection of calls at given SNP positions for the 
chromosome e. Each component in the vector e for 
each individual chromosome takes value 0, 1, -1 or NA, 
where -1 and 1 are assigned to the two possible homo- 
zygotes of a given SNP and 0 corresponds to a heterozy- 
gote. The space of all possible calls (excluding NAs) sits 
as a discrete set in an TV-dimensional real vector- space E 
with canonical basis. 

We then consider a sliding window along each chro- 
mosome. Such window w refers to a contiguous sub- 
range of SNPs, that is: 

w = { w first , iu first + 1, u; last - 1, w last } . 

The width of a window is defined by the genetic dis- 
tance between the last and the first SNP in the window: 



Suppose we are given an oriented line A in the vector- 
space E (later A will be taken to be a principal axis of a 
certain collection of individual chromosomes). Line A is 
spanned by a single non-zero vector with coordinates 

( a *)i^-i defined U P to multiplication by a positive 
number. 
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For a given window w and an individual chromosome 
e, define a measurement associated with this window 
with respect to the axis A by: 

Yj a i e i 



Obviously, the resulting value does not depend on the 
choice of a vector spanning A 

Given a point x along the chromosome, we say that a 
window w is centered at a point x and has a width / if it 
consists of exactly those SNPs that lie within the dis- 
tance 1/2 from x, that is: 



w,(x) : 



ju; \x-p w |<lj. 



Consider sample collections {p^ and {q^ from two popu- 
lations P and Q, respectively. Let a = (a 7 ) ^ define the 

principal axis in SNP-coordinates for a collection of sam- 
ples {p /c ,q 7< }. Given a point x 0 along the chromosome and a 
window w centered at this point, consider measurements: 

a k := M w (p k ) and p k := M^{q k ). 

Given a positive real number A > 0, we say that popu- 
lations P and Q are A -separated in a window w if: 



|Ea-E/?|>A|a(cO + o-(£)| 



(2) 



where E stands for the mean and o for the standard 
deviation estimators of the PCI coordinate of popula- 
tions P and Q. 

Finally, a window w is called A-optimal if it has the smal- 
lest size, such that populations P and Q are A-separated in 
w (that is the smallest possible window that satisfies Con- 
dition 2). The size of this window at each position is cho- 
sen so that the ancestral populations are sufficiently well 
separated by the statistical properties of the collection of 
SNPs in the window. We set A = 3, that is we consider a 
window as optimal, when the mean PCI coordinates for 
the parental populations are separated by three standard 
deviations from each mean. Note that in general the opti- 
mal size will depend on the position of the window along 
the chromosome. In effect, by taking smaller windows in 
chromosomal regions that contain more informative 
SNPs, we are able to increase signal resolution without 
introducing excess errors into the ancestry estimation. 

In summary, for the sample collections {p 7 } <= P and 
{q 7 } c Q we construct the following data: a) the princi- 
pal axis A spanned by a non-zero vector with coordi- 
nates (flj)^ f° r a collection of samples {p 7 , q 7 }; b) a 
collection of K equispaced points {x k } k=1 along the 



chromosome; and c) for each point x k , k = 1,..., K we 
construct a A-optimal window centered at x k . These 
data we refer to as a StepPCO frame. (Special care 
needs to be taken so that the windows cover the entire 
chromosome, leaving no gaps in between. This can 
always be achieved by choosing the number of bins K 
sufficiently large.) 

StepPCO 

Using coefficients of PA1 as weights, we find the average 
value of SNPs within a window. The resulting values are 
then normalized, so that the ancestral populations corre- 
spond to values with means of 1 and -1, respectively. 

Consider collections {p 7 } and { q 7 } of samples from 
two ancestral populations P and Q and a set of admixed 
individuals R = {r 7 }. Construct a StepPCO frame (in our 
applications A = 3 and number of bins is K = 1,024). 

For an individual chromosome r from population R, 
define the StepPCO signal as a vector of measurements: 



6(r) :=(M V 



>))L- 



Each component of the vector 6(r) characterizes a 
stretch of chromosome r corresponding to each bin as 
belonging to one of the ancestral populations, and the 
confidence of such an assignment. The level of confi- 
dence will of course depend on A. 

Suppose the principal axis is directed from P to Q. 
Then the closer the value of the signal to 1, the more 
likely that the corresponding stretch of the chromosome 
of the sample r belongs to ancestral population Q. 

Analogously, values close to -1 correspond to the 
stretches of DNA most likely tracing their ancestry to 
population P. 

Wavelet transform 

Consider a 2 L -dimensional vector-space y _ ^ L with 
the scalar product: 



{{ u M v i)) : =^ u i v i- 



Wavelets (co^ p ) are the orthogonal system of 2 L - 1 
vectors in V . They are indexed by the level / = 1,..., L 
and the position p = 1,..., 2 /_1 . The level corresponds to 
a particular frequency of the wave, and is up to an addi- 
tive constant the negative logarithm base 2 of the period 
(see formula 3), while the position denotes the position 
of the wavelet of the given frequency within the signal. 
Each wavelet at level / has a support of the size 
and is a rectangular wave of amplitude 1 and zero 
average. 
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The coefficients of a wavelet (co^ p ) are given by: 



1 if (p - l)2 L " Z+1 + 1 < k < (p - l)2 L " Z+1 + 2 L ~ l , 

-l if (p - 1)2 L ~ z+1 + 2 L " z + 1 < A: < p2 L " z+1 , 
o otherwise. 



Together with the vector with all coefficients equal to 
1, wavelets form an orthogonal basis of V. Suppose 

y = (y .)?_^ is a signal with a discrete time, that is a 2 L - 

dimensional real vector from V. Its wavelet coefficients 
are simply coefficients of / with respect to the wavelet 
basis. They could be efficiently evaluated by passing y 
through a series of filters (linear operators) obtaining at 
each step: i) wavelet coefficients for a given level, and ii) 
a downsampled signal to which the next round of eva- 
luation is to be applied: 



For A: = i,...,2 



Fork = i,...,2 



low pass filter, 



wtL.fcW : = -0 2 fc -r 2 fc-i) high pass filter, 



7fc := 



low pass filter, 



wt L-i,k(.r) : = -(r'zk-Yzk-i) high pass filter. 



As a result, one obtains 2 1 wavelet coefficients and one 
additional value: of the last downsampled signal (/average)- 
This /average corresponds to the average of y. Note that the 
level / of a wavelet is defined as the logarithm base 1/2 of 
the halfperiod p of the wave, where period is measured as 
a fraction of the length of the entire signal: 



P = 



1 2 



L+i-l 



-2 or / = log 1 p. 



(3) 



Thus, to compare wavelet coefficients of signals corre- 
sponding to some physical phenomena, levels should be 
shifted by the logarithm base 1/2 of the physical length 
of the signal. 

Wavelet transform of the discrete signal is lossless 
[22], and the original signal could be recovered from its 
wavelet coefficients and its average, as: 



7 = 7, 



average 



1 

V J 



l,k 



We will refer to this as inverse wavelet transform and 
write: 



y = iwt(wt lk ,y 



average 

For the noise reduction, one decides which wavelet 
amplitudes at each frequency and location are to be 



considered unimportant, and filters the wavelet coeffi- 
cients removing (setting to zero) corresponding coeffi- 
cients. Inverse wavelet transform then produces a signal 
with the 'noise' removed. An example of this procedure 
is shown in Figure S8 in Additional file 1 where random 
noise is added to the signal and then removed using 
procedure described above. 

That is, given a collection t = {t t) k ) of threshold levels, 
one for each of the wavelets in the wavelet decomposi- 
tion, define a noise filter % on the set of wavelet coeffi- 
cients by: 

wt = T (wt), where 

[o if |wt^|<f^, 
wtz,fc=i 

wt; k otherwise. 

Given a signal containing noise, one finds its wavelet 
coefficients, applies a noise filter X to the set, and uses 
an inverse wavelet transform to recover the signal with- 
out the noise. The cleaned signal then is: 

/cleaned = iwt(X(wt(y))). 

To assess the spectral characteristics of the signal we 
use the following procedure: Suppose y is a discrete sig- 
nal as above, and wt = (wt/, k ) is the collection of its 
wavelet coefficients. First we calculate the so called 
wavelet summary by averaging absolute values of wave- 
let coefficients at each level. That is: 



r 

s l •- 



wt 



l,k I 



l-i 



1 = 1,. ..L. 



Each (non-negative) number s t shows the abundance 
of the corresponding wavelet frequency in the signal. 
Finally, the dominant frequency present in the signal, so 
called center, is evaluated as follows: 



C(y) = 



Thus, C(y) is a 'central' frequency of the signal y, and 
is used as an indirect measure of the average width of 
the admixture blocks. 

Additional material 



Additional file 1: Supplementary figures. Additional file includes eight 
supplemental figures. 
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